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Abstract 

Deep learning techniques have been successfully ap¬ 
plied in many areas of computer vision, including low-level 
image restoration problems. For image super-resolution, 
several models based on deep neural networks have been 
recently proposed and attained superior performance that 
overshadows all previous handcrafted models. The question 
then arises whether large-capacity and data-driven models 
have become the dominant solution to the ill-posed super¬ 
resolution problem. In this paper, we argue that domain 
expertise represented by the conventional sparse coding 
model is still valuable, and it can be combined with the key 
ingredients of deep learning to achieve further improved 
results. We show that a sparse coding model particularly 
designed for sup er-re solution can be incarnated as a neural 
network, and trained in a cascaded structure from end to 
end. The interpretation of the network based on sparse 
coding leads to much more efficient and effective training, 
as well as a reduced model size. Our model is evaluated on 
a wide range of images, and shows clear advantage over ex¬ 
isting state-of-the-art methods in terms of both restoration 
accuracy and human subjective quality. 

1. Introduction 

Single image super-resolution (SR) aims at obtaining 
a high-resolution (HR) image from a low-resolution (LR) 
input image by inferring all the missing high frequency 
contents. With the known variables in LR images greatly 
outnumbered by the unknowns in HR images, SR is a highly 
ill-posed problem and the current techniques are far from 
being satisfactory for many real applications 1^1^. 

To regularize the solution of SR, people have exploited 
various priors of natural images. Analytical priors, such as 
bicubic interpolation, work well for smooth regions; while 
image models based on statistics of edges ini and gradients 
ifTTl [ll can recover sharper structures. In the patch-based 
SR methods, HR patch candidates are represented as the 
sparse linear combination of dictionary atoms trained from 


external databases |[36l|35l, or recovered from similar exam¬ 
ples in the LR image itself at different locations and across 
different scales (131 [121 |32l. A comprehensive review of 
more SR methods can be found in (3^ . 

More recently, inspired by the great success achieved by 
deep learning (T^ [27l [30l in other computer vision tasks, 
people begin to use neural networks with deep architec¬ 
ture for image SR. Multiple layers of collaborative auto¬ 
encoders are stacked together in (61 for robust matching of 
self-similar patches. Deep convolutional neural networks 
(CNN) @ and deconvolutional networks (25l are designed 
that directly learn the non-linear mapping from LR space 
to HR space in a way similar to coupled sparse coding 
(m. As these deep networks allow end-to-end training 
of all the model components between LR input and HR 
output, significant improvements have been observed over 
their shadow counterparts. 

The networks in Ena are built with generic architec¬ 
tures, which means all their knowledge about SR is learned 
from training data. On the other hand, people’s domain 
expertise for the SR problem, such as natural image prior 
and image degradation model, is largely ignored in deep 
learning based approaches. It is then worthy to investigate 
whether domain expertise can be used to design better 
deep model architectures, or whether deep learning can be 
leveraged to improve the quality of handcrafted models. 

In this paper, we extend the conventional sparse coding 
model (SSh using several key ideas from deep learning, 
and show that domain expertise is complementary to large 
learning capacity in further improving SR performance. 
First, based on the learned iterative shrinkage and threshold¬ 
ing algorithm (LISTA) (141, we implement a feed-forward 
neural network whose layers strictly correspond to each step 
in the processing fiow of sparse coding based image SR. 
In this way, the sparse representation prior is effectively 
encoded in our network structure; at the same time, all 
the components of sparse coding can be trained jointly 
through back-propagation. This simple model, which is 
named sparse coding based network (SCN), achieves no¬ 
table improvement over the generic CNN model in terms 
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of both recovery accuracy and human perception, and yet 
has a compact model size. Moreover, with the correct 
understanding of each layer’s physical meaning, we have 
a more principled way to initialize the parameters of SCN, 
which helps to improve optimization speed and quality. 

A single network is only able to perform image SR by 
a particular scaling factor. In IH, different networks are 
trained for different scaling factors. In this paper, we also 
propose a cascade of multiple SCNs to achieve SR for 
arbitrary factors. This simple approach, motivated by the 
self-similarity based SR approach na, not only increases 
the scaling flexibility of our model, but also reduces artifacts 
for large scaling factors. The cascade of SCNs (CSCN) can 
also benefit from the end-to-end training of deep network 
with a specially designed multi-scale cost function. 

In short, the contributions of this paper include: 

• combine the domain expertise of sparse coding and the 
merits of deep learning to achieve better SR perfor¬ 
mance with faster training and smaller model size; 

• use network cascading for large and arbitrary scaling 
factors; 

• conduct a subjective evaluation on several recent state- 
of-the-art methods. 

In the following, we will first review related work in 
Sec. [21 The SCN and CSCN models are introduced in Sec. [3] 
and Sec.[^ with implementation details in Sec.[^ Extensive 
experimental results are reported in Sec.[^ and conclusions 
are drawn in Sec. [71 

2. Related Work 

2.1. Image SR Using Sparse Coding 

The sparse representation based SR method |[36ll mod¬ 
els the transform from each local patch y G in 

the bicubic-upscaled LR image to the corresponding patch 
X G in the HR image. The dimension ruy is not 

necessarily the same as rux when image features other than 
raw pixel is used to represent patch y. It is assumed that 
the LR(HR) patch y(x) can be represented with respect 
to an overcomplete dictionary Dy(Dx) using some sparse 
linear coefficients cXy(cXx) G which are known as 

sparse code. Since the degradation process from cc to y is 
nearly linear, the patch pair can share the same sparse code 
o^y = OLx = a if the dictionaries Dy and Dx are defined 
properly. Therefore, for an input LR patch y, the HR patch 
can be recovered as 

x = D^a, s.t. a = argimn||y-Dj^2||2+A||2||i, ( 1 ) 

where || • ||i denotes the ii norm which is convex and 
sparsity-inducing, and A is a regularization coefficient. The 
dictionary pair {Dy^ Dx) can be learned alternatively with 
the inference of training patches’ sparse codes in their joint 
space |[36ll or through bi-level optimization ISSll . 



Figure 1. A LISTA network ifTTl with 2 time-unfolded recurrent 
stages, whose output a is an approximation of the sparse code 
of input signal y. The linear weights W, S and the shrinkage 
thresholds 0 are learned from data. 


2.2. Network Implementation of Sparse Coding 

There is an intimate connection between sparse coding 
and neural network, which has been well studied in 
m. A feed-forward neural network as illustrated in Fig.[2is 
proposed in C3 to efficiently approximate the sparse code 
OL of input signal y as it would be obtained by solving Q 
for a given dictionary Dy. The network has a finite number 
of recurrent stages, each of which updates the intermediate 
sparse code according to 


Zk+i = h0{Wy + Szk), ( 2 ) 

where he is an element-wise shrinkage function defined as 
[h 0 {a)]i = sign(ai)(|ai| — with positive thresholds 6 . 

Different from the iterative shrinkage and thresholding 
algorithm (ISTA) 171 [261 which finds an analytical rela¬ 
tionship between network parameters (weights W, S and 
thresholds 6) and sparse coding parameters (Dy and A), 
the authors of C3 learn all the network parameters from 
training data using a back-propagation algorithm called 
learned ISTA (LISTA). In this way, a good approximation 
of the underlying sparse code can be obtained within a fixed 
number of recurrent stages. 

3. Sparse Coding based Network for Image SR 

Given the fact that sparse coding can be effectively 
implemented with a LISTA network, it is straightforward to 
build a multi-layer neural network that mimics the process¬ 
ing flow of the sparse coding based SR method 13^ . Same 
as most patch-based SR methods, our sparse coding based 
network (SCN) takes the bicubic-upscaled LR image ly as 
input, and outputs the full HR image Ix- Fig.shows the 
main network structure, and each of the layers is described 
in the following. 

The input image ly first goes through a convolutional 
layer H which extracts feature for each LR patch. There 
are rriy filters of spatial size Sy x Sy in this layer, so that our 
input patch size is Sy x Sy and its feature representation y 
has rriy dimensions. 

Each LR patch y is then fed into a LISTA network with 
a finite number of k recurrent stages to obtain its sparse 
code a G Each stage of LISTA consists of two linear 
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Figure 2. Top left: the proposed SCN model with a patch extraction layer H, a LISTA sub-network for sparse coding (with k recurrent 
stages denoted by the dashed box), a HR patch recovery layer Dx, and a patch combination layer G. Top right: a neuron with an 
adjustable threshold decomposed into two linear scaling layers and a unit-threshold neuron. Bottom: the SCN re-organized with unit- 
threshold neurons and adjacent linear layers merged together in the gray boxes. 


layers parameterized by W G and S G 

and a nonlinear neuron layer with activation function Hq. 
The activation thresholds 0 G are also to be updated 
during training, which complicates the learning algorithm. 
To restrict all the tunable parameters in our linear layers, we 
do a simple trick to rewrite the activation function as 

[hg{a)]i = s\gn{ai)0i{\ai\/0i - 1)+ = 0ihi{ail0i). (3) 

Eq. ^ indicates the original neuron with an adjustable 
threshold can be decomposed into two linear scaling layers 
and a unit-threshold neuron, as shown in the top-right of 
Fig. 1^ The weights of the two scaling layers are diagonal 
matrices defined by 6 and its element-wise reciprocal, re¬ 
spectively. 

The sparse code a is then multiplied with HR dictionary 
Dx G in the next linear layer, reconstructing HR 

patch X of size = rrix. 

In the final layer G, all the recovered patches are put 
back to the corresponding positions in the HR image Ix> 
This is realized via a convolutional filter of rrix channels 
with spatial size SgXSg. The size Sg is determined as 
the number of neighboring patches that overlap with the 
same pixel in each spatial direction. The filter will assign 
appropriate weights to the overlapped recoveries from dif¬ 
ferent patches and take their weighted average as the final 
prediction in Ix . 

As illustrated in the bottom of Fig. after some simple 
reorganizations of the layer connections, the network de¬ 
scribed above has some adjacent linear layers which can 
be merged into a single layer. This helps to reduce the 
computation load as well as redundant parameters in the 
network. The layers H and G are not merged because 
we apply additional nonlinear normalization operations on 
patches y and x, which will be detailed in Sec.[^ 

Thus, there are totally 5 trainable layers in our network: 
2 convolutional layers H and G, and 3 linear layers shown 
as gray boxes in Fig. The k recurrent layers share the 


same weights and are therefore conceptually regarded as 
one. Note that all the linear layers are actually implemented 
as convolutional layers applied on each patch with filter 
spatial size of 1x1, a structure similar to the network in 
network 1201 . Also note that all these layers have only 
weights but no biases (zero biases). 

Mean square error (MSE) is employed as the cost func¬ 
tion to train the network, and our optimization objective can 
be expressed as 

mmy^||5C7V(/«;©)-J«||2, (4) 

i 

where and are the i-th pair of LR/HR training data, 
and SCN{Iy] 0) denotes the HR image for ly predicted 
using the SCN model with parameter set 0. All the param¬ 
eters are optimized through the standard back-propagation 
algorithm. Although it is possible to use other cost terms 
that are more correlated with human visual perception than 
MSE, our experimental results show that simply minimizing 
MSE leads to improvement in subjective quality. 

Advantages over Previous Models 

The construction of our SCN follows exactly each step 
in the sparse coding based SR method fSEl . If the network 
parameters are set according to the dictionaries learned in 
(361, it can reproduce almost the same results. However, 
after training, SCN learns a more complex regression func¬ 
tion and can no longer be converted to an equivalent sparse 
coding model. The advantage of SCN comes from its ability 
to jointly optimize all the layer parameters from end to end; 
while in (3^ some variables are manually designed and 
some are optimized individually by fixing all the others. 

Technically, our network is also a CNN and it has similar 
layers as the CNN model proposed in m for patch extrac¬ 
tion and reconstruction. The key difference is that we have a 
LISTA sub-network specifically designed to enforce sparse 
representation prior; while in (81 a generic rectified linear 





































































unit (ReLU) El is used for nonlinear mapping. Since 
SCN is designed based on our domain knowledge in sparse 
coding, we are able to obtain a better interpretation of the 
filter responses and have a better way to initialize the filter 
parameters in training. We will see in the experiments that 
all these contribute to better SR results, faster training speed 
and smaller model size than a vanilla CNN. 


4. Network Cascade for Scalable SR 


Like most SR models learned from external training ex¬ 
amples, the SCN discussed previously can only upscale im¬ 
ages by a fixed factor. A separate model needs to be trained 
for each scaling factor to achieve the best performance, 
which limits the fiexibility and scalability in practical use. 
One way to overcome this difficulty is to repeatedly enlarge 
the image by a fixed scale until the resulting HR image 
reaches a desired size. This practice is commonly adopted 
in the self-similarity based methods cainiEi, but is not 
so popular in other cases for the fear of error accumulation 
during repetitive upscaling. 

In our case, however, it is observed that a cascade of 
SCNs (CSCN) trained for small scaling factors can generate 
even better SR results than a single SCN trained for a large 
scaling factor, especially when the target scaling factor is 
large (greater than 2). This is illustrated by the example 
in Fig. Here an input image is magnified by x4 times 
in two ways: with a single SCNx4 model through the 
processing fiow (a) ^ (b) ^ (d); and with a cascade of two 
SCNx2 models through (a) ^ (c) ^ (e). It can be seen that 
the input to the second cascaded SCN x 2 in (c) is already 
sharper and contains less artifacts than the bicubic x 4 input 
to the single SCNx4 in (b), which naturally leads to the 
better final result in (e) than the one in (d). Therefore, 
each SCN in the cascade serves as a “relaying station” 
which progressively recovers some useful information lost 
in bicubic interpolation and compensates for the distortion 
aggregated from previous stages. 

The CSCN is also a deep network, in which the output 
of each SCN is connected to the input of the next SCN 
with bicubic interpolation in the between. To construct the 
cascade, besides stacking several SCNs trained individually 
with respect to (0, we can also optimize all of them jointly 
as shown in Fig. Without loss of generality, we assume 
each SCN in the cascade has the same scaling factor s. Let 
Jo denote the input image of original size, and Ij (j>0) 
denote the output image of the j-th SCN upscaled by a total 
of xs^ times. Each Ij can be compared with its associated 
ground truth image Ij according to the MSB cost, leading 
to a multi-scale objective function: 


min 

{€>.} 


EEll SCN{lf_^^s]Sj) 



where i denotes the data index, and j denotes the SCN 



(b) bicubicX4 (28.52) 



(d) SCN X 4 (30.22) (e) SCN x 2 & SCN x 2 (30.72) 

Figure 3. SR results for the “Lena” image upscaled by 4 times, (a) 
^ (b) ^ (d) represents the processing flow with a single SCN x 4 
model, (a) ^ (c) ^ (e) represents the processing flow with two 
cascaded SCN x 2 models. PSNR is given in parentheses. 
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Figure 4. Training cascade of SCNs with multi-scale objectives. 


index. J^s is the bicubic interpolated image of J by a 
factor of s. This multi-scale objective function makes full 
use of the supervision information in all scales, sharing a 
similar idea as heterogeneous networks csiia. All the layer 
parameters {©j} in ^ could be optimized from end to 
end by back-propagation. We use a greedy algorithm here 
to train each SCN sequentially from the beginning of the 
cascade so that we do not need to care about the gradient 
of bicubic layers. Applying back-propagation through a 
bicubic layer or its trainable surrogate will be considered 
in future work. 


5. Implementation Details 

We determine the number of nodes in each layer of our 
SCN mainly according to the corresponding settings used 
in sparse coding 1351 . Unless otherwise stated, we use 













input LR patch size 5 ^= 9 , LR feature dimension my=100, 
dictionary size n=128, output HR patch size and 

patch aggregation filter size Sg=b. All the convolution 
layers have a stride of 1. Each LR patch y is normalized 
by its mean and variance, and the same mean and variance 
are used to restore the final HR patch x. We crop 56x56 
regions from each image to obtain fixed-sized input samples 
to the network, which produces outputs of size 44x44. 

To reduce the number of parameters, we implement 
the LR patch extraction layer H as the combination of 
two layers: the first layer has 4 trainable filters each of 
which is shifted to 25 fixed positions by the second layer. 
Similarly, the patch combination layer G is also split into 
a fixed layer which aligns pixels in overlapping patches 
and a trainable layer whose weights are used to combine 
overlapping pixels. In this way, the number of parameters 
in these two layers are reduced by more than an order, and 
there is no observable loss in performance. 

We employ a standard stochastic gradient descent al¬ 
gorithm to train our networks with mini-batch size of 64. 
Based on the understanding of each layer’s role in sparse 
coding, we use Harr-like gradient filters to initialize layer 
H, and use uniform weights to initialize layer G. All the 
remaining three linear layers are related to the dictionary 
pair {Dx, Dy) in sparse coding. To initialize them, we first 
randomly set Dx and Dy with Gaussian noise, and then 
find the corresponding layer weights as in ISTA f7| : 

Wi = C-Dl, W2 = I-D^Dy, Ws = {CL)-^-D, ( 6 ) 

where wi, W 2 and it ?3 denote the weights of the three 
subsequent layers after layer H. L is the upper bound on 
the largest eigenvalue of DyDy, and C is the threshold 
value before normalization. We empirically set L=C=b. 

The proposed models are all trained using the CUBA 
ConvNet package CD on a workstation with 12 Intel Xeon 
2.67GHz CPUs and 1 GTX680 GPU. Training a SCN 
usually takes less than one day. Note that this package is 
customized for classification networks, and its efficiency 
can be further optimized for our SCN model. 

In testing, to make the entire image covered by output 
samples, we crop input samples with overlap and extend 
the boundary of original image by reflection. Note we 
shave the image border in the same way as for objective 
evaluations to ensure fair comparison. Only the lumi¬ 
nance channel is processed with our method, and bicubic 
interpolation is applied to the chrominance channels. To 
achieve arbitrary upscaling factors using CSCN, we upscale 
an image by x 2 times repeatedly until it is at least as large 
as the desired size. Then a bicubic interpolation is used to 
downscale it to the target resolution if necessary. 

When reporting our best results in Sec. |6.2| we also 
use the multi-view testing strategy commonly employed in 
image classification. Lor patch-based image SR, multi-view 



Figure 5. The four learned filters in the first layer H. 



Figure 6. The PSNR change for x 2 SR on Set5 during training 
using different methods: SCN; SCN with random initialization; 
CNN. The horizontal dash lines show the benchmarks of bicubic 
interpolation and sparse coding (SC). 

testing is implicitly used when predictions from multiple 
overlapping patches are averaged. Here, besides sampling 
overlapping patches, we also add more views by Hipping 
and transposing the patch. Such strategy is found to im¬ 
prove SR performance for general algorithms at the sheer 
cost of computation. 

6. Experiments 

We evaluate and compare the performance of our models 
using the same data and protocols as in 1 ^ . which are 
commonly adopted in SR literature. All our models are 
learned from a training set with 91 images, and tested on 
Set5 O, Setl4 (371 and BSDIOO (TSj which contain 5, 
14 and 100 images respectively. We have also trained on 
a different larger data set, and observe little performance 
change (less than O.ldB). The original images are down¬ 
sized by bicubic interpolation to generate LR-HR image 
pairs for both training and evaluation. The training data are 
augmented with translation, rotation and scaling. 

6.1. Algorithm Analysis 

We first visualize the four filters learned in the first layer 
H in Fig. The filter patterns do not change much from 
the initial first and second order gradient operators. Some 
additional small coefficients are introduced in a highly 
structured form that capture richer high frequency details. 

The performance of several networks during training is 
measured on Set5 in Fig.|^ Our SCN improves significantly 
over sparse coding (SC) (351, as it leverages data more 
effectively with end-to-end training. The SCN initialized 
according to ([^ can converge faster and better than the 
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Figure 7. PSNR for x 2 SR on Set5 using SCN and CNN with 
various network sizes. 

Table 1. PSNR of different network cascading schemes on Set5, 
evaluated for different scaling factors in each column. 


upscale factor 

xl.5 

x2 

x3 

x4 

SCNxl.5 

40.14 

36.41 

30.33 

29.02 

SCNx2 

40.15 

36.93 

32.99 

30.70 

SCNx3 

39.88 

36.76 

32.87 

30.63 

SCNx4 

39.69 

36.54 

32.76 

30.55 

CSCN 

40.15 

36.93 

33.10 

30.86 


same model with random initialization, which indicates that 
the understanding of SCN based on sparse coding can help 
its optimization. We also train a CNN model El of the 
same size as SCN, but find its convergence speed much 
slower. It is reported in El that training a CNN takes 
8x10^ back-propagations (equivalent to 12.5x10^ mini¬ 
batches here). To achieve the same performance as CNN, 
our SCN requires less than 1% back-propagations. 

The network size of SCN is mainly determined by the 
dictionary size n. Besides the default value n=128, we 
have tried other sizes and plot their performance versus the 
number of network parameters in Fig. [7] The PSNR of 
SCN does not drop too much as n decreases from 128 to 
64, but the model size and computation time can be reduced 
significantly. Fig. [7] also shows the performance of CNN 
with various sizes. Our smallest SCN can achieve higher 
PSNR than the largest model (CNN-L) in O while only 
using about 20% parameters. 

Different numbers of recurrent stages k have been tested 
for SCN, and we find increasing k from 1 to 3 only improves 
performance by less than O.ldB. As a tradeoff between 
speed and accuracy, we use k=l throughout the paper. 

In Table [T] different network cascade structures (in each 
row) are compared at different scaling factors (in each 
column). SCNxa denotes the simple cascade of SCN with 
fixed scaling factor a, where an individually trained SCN 
is applied one or more times for scaling factors other than 
a. It is observed that SCNx2 can perform as well as 
the scale-specific model for small scaling factor (1.5), and 
much better for large scaling factors (3 and 4). Note that 
the cascade of SCN x 1.5 does not lead to good results since 


artifacts quickly get amplified through many repetitive up¬ 
scalings. Therefore, we use SCNx2 as the default building 
block for CSCN, and drop the notation x2 when there is 
no ambiguity. The last row in Table shows that a CSCN 
trained using the multi-scale objective in can further 
improve the SR results for scaling factors 3 and 4, as the 
second SCN in the cascade is trained to be robust to the 
artifacts generated by the first one. 

6.2. Comparison with State of the Arts 

We compare the proposed CSCN with other recent SR 
methods on all the images in Set5, Setl4 and BSDIOO 
for different upscaling factors. Table shows the PSNR 
and structural similarity (SSIM) ED for adjusted anchored 
neighborhood regression (A-\-) ||29l, CNN El. CNN trained 
with larger model size and more data (CNN-L) El, the 
proposed CSCN, and CSCN with our multi-view testing 
(CSCN-MV). We do not list other methods l^l28llT7lfT7l 
na whose performance is worse than A-f or CNN-L. 

It can be seen from Table that CSCN performs consis¬ 
tently better than all previous methods in both PSNR and 
SSIM, and with multi-view testing the results can be further 
improved. CNN-L improves over CNN by increasing model 
parameters and training data. However, it is still not as good 
as CSCN which is trained with a much smaller size and on 
a much smaller data set. Clearly, the better model structure 
of CSCN makes it less dependent on model capacity and 
training data in improving performance. Our models are 
generally more advantageous for large scaling factors due 
to the cascade structure. 

The visual qualities of the SR results generated by sparse 
coding (SC) ||^, CNN and CSCN are compared in Fig.[^ 
Our approach produces image patterns with shaper bound¬ 
aries and richer textures, and is free of the ringing artifacts 
observable in the other two methods. 

Fig. shows the SR results on the “chip” image com¬ 
pared among more methods including the self-example 
based method (SE) lEl and the deep network cascade 
(DNC) 16). SE and DNC can generate very sharp edges 
on this image, but also introduce artifacts and blurs on 
comers and fine stmctures due to the lack of self-similar 
patches. On the contrary, the CSCN method recovers all the 
stmctures of the characters without any distortion. 

We also compare CSCN with other sparse coding ex¬ 
tensions 1221 Uni EH. and consider the blurring effect 
introduced in downscaling. A PSNR gain of 0.3^1.6dB 
is achieved by CSCN in general. Experiment details and 
source codes are available onlin^B 

6.3. Subjective Evaluation 

We conducted a subjective evaluation of SR results 
for several methods including bicubic, SC Ea, SE CD, 

^WWW.ifp.Illinois.edu/~dingliu2/iccvl5 


SCN(n=96) 


SCN(n=128) 


SCN(n=64) 


CNN-S 


CNN-L 


CNN-M 










Table 2. PSNR (SSIM) comparison on three test data sets among different methods. Red indicates the best and blue indicates the second 
best performance. The performance gain of our best model over all the others’ best is shown in the last row. 


Data Set 

Sets 

Setl4 

BSDIOO 

Upscaling 

x2 

x3 

x4 

x2 

x3 

x4 

x2 

x3 

x4 

A+EH 

36.55 

32.59 

30.29 

32.28 

29.13 

27.33 

30.78 

28.18 

26.77 

(0.9544) 

(0.9088) 

(0.8603) 

(0.9056) 

(0.8188) 

(0.7491) 

(0.8773) 

(0.7808) 

(0.7085) 

cnnO 

36.34 

32.39 

30.09 

32.18 

29.00 

27.20 

31.11 

28.20 

26.70 

(0.9521) 

(0.9033) 

(0.8530) 

(0.9039) 

(0.8145) 

(0.7413) 

(0.8835) 

(0.7794) 

(0.7018) 

CNN-L (U 

36.66 

32.75 

30.49 

32.45 

29.30 

27.50 

31.36 

28.41 

26.90 

(0.9542) 

(0.9090) 

(0.8628) 

(0.9067) 

(0.8215) 

(0.7513) 

(0.8879) 

(0.7863) 

(0.7103) 

CSCN 

36.93 

33.10 

30.86 

32.56 

29.41 

27.64 

31.40 

28.50 

27.03 

(0.9552) 

(0.9144) 

(0.8732) 

(0.9074) 

(0.8238) 

(0.7578) 

(0.8884) 

(0.7885) 

(0.7161) 

CSCN-MV 

37.14 

33.26 

31.04 

32.71 

29.55 

27.76 

31.54 

28.58 

27.11 

(0.9567) 

(0.9167) 

(0.8775) 

(0.9095) 

(0.8271) 

(0.7620) 

(0.8908) 

(0.7910) 
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Figure 8. SR results given by SC (35) (first row), CNN m (second row) and our CSCN (third row). Images from left to right: the “monarch” 
image upscaled by x3; the “zebra” image upscaled by x3; the “comic” image upscaled by x3. 


self-example regression (SER) 1341 . CNN 0 and CSCN. 
Ground truth HR images are also included when they are 


available as references. Each of the participants in the 
evaluation is shown a set of HR image pairs, which are 
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' x4 times using different methods. 



bicubic SC SE SER CNN CSCN 


Figure 10. Subjective SR quality scores for different methods 
including bicubic, SC IS), SE SER (H, CNN |i8( and the 
proposed CSCN. The score for ground truth result is 1. 


upscaled from the same LR images using two randomly 
selected methods. For each pair, the subject needs to decide 
which one is better in terms of perceptual quality. 

We have a total of 270 participants giving 720 pairwise 
comparisons over 6 images with different scaling factors. 
Not every participant completed all the comparisons but 
their partial responses are still useful. All the evaluation 
results can be summarized into a 7 x 7 winning matrix for 7 
methods (including ground truth), based on which we fit a 
Bradley-Terry 0| model to estimate the subjective score for 
each method so that they can be ranked. 

Fig, shows the estimated scores for the 6 SR methods 
in our evaluation, with the score for ground truth method 
normalized to 1. As expected, all the SR methods have 
much lower scores than ground truth, showing the great 
challenge in SR problem. The bicubic interpolation is 
significantly worse than other SR methods. The proposed 
CSCN method outperforms other previous state-of-the-art 


methods by a large margin, demonstrating its superior vi¬ 
sual quality. It should be noted that the visual difference 
between some image pairs is very subtle. Nevertheless, 
the human subjects are able to perceive such difference 
when seeing the two images side by side, and therefore 
make consistent ratings. The CNN model becomes less 
competitive in the subjective evaluation than it is in PSNR 
comparison. This indicates that the visually appealing 
image appearance produced by CSCN should be attributed 
to the regularization from sparse representation, which can 
not be easily learned by merely minimizing reconstruction 
error as in CNN. 

7. Conclusions 

We propose a new model for image SR by combining 
the strengths of sparse coding and deep network, and make 
considerable improvement over existing deep and shallow 
SR models both quantitatively and qualitatively. Besides 
producing good SR results, the domain knowledge in the 
form of sparse coding can also benefit training speed and 
model compactness. Furthermore, we propose a cascaded 
network for better fiexibility in scaling factors as well as 
more robustness to artifacts. 

In future work, we will apply the SCN model to other 
problems where sparse coding can be useful. The inter¬ 
action between deep networks for low-level and high-level 
vision tasks will also be explored. 
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